Guidelines for Preserving New Forms of Scholarship

Guidelines

If a publication platform integrates third party applications for features such as annotations or comments, the publisher should ensure that the terms of service for that application provide appropriate permission for preserving and migrating that content over time. For example, Hypothesis’ Terms of Service specify that the copyright of annotation data is CC0.

See also:
14. Avoid being dependent on third party services for core features
15. Plan a strategy for preservation when third party dependencies exist

Thinking through the best ways to present and preserve media assets such as video early in the publication cycle will allow for lead time to implement best practices for preservation, such as procuring and/or licensing media for local hosting or exclusively for preservation, or to choosing remote services better suited to web harvesting.

Move supporting files such as multimedia, fonts, JavaScript, and CSS, local to the publication or inside the application used for publishing. This helps ensure the vital components of the work can be easily packaged together, reduces ongoing maintenance, and helps ensure exports contain all necessary resources.

If this is impractical in the live environment, other guidelines may be relevant:
15. Develop a strategy to capture any external media content
16. Captions for non-text features add meaningful context
20. Ensure all core intellectual components of a work are reflected in the export package
29. Consider a preservation-specific EPUB in your workflow
51. Host media files local to the website

Sometimes it is necessary or preferable to reference or embed third-party media content that is outside of the control of the publisher but integral to the understanding of the work. For these features, anticipate that their availability may be temporary and make plans to ensure that they are not only preserved, but sustained in some form within the publication while they are on the publisher platform. In the case of an embedded YouTube video, for example, some options to support preservation might include: retaining or requesting a copy of the video file; getting permission to take a copy of the content using the YouTube-DL tool in order to bring it into the local publication; or archiving and linking to a copy on the Internet Archive. An informative caption can help support future readers if the content is unavailable.

These guidelines may also improve preservability of third party hosted media:
12. Start discussions about multimedia early in the project
14. Avoid externally hosted media
16. Captions for non-text features add meaningful context
20. Ensure all core intellectual components of a work are reflected in the export package

Embedded enhanced features, especially those that link to resources outside of the publication or use an unusual format, are at the highest risk of failing in the future. For this reason, a meaningful caption is vital for providing clues to future readers about what they should expect to find in that location in the text, and preferably some means of finding it and accessing it. Ideally, this caption would include a title, source, unique persistent identifier (e.g. DOI, ARK ID, or Handle), and a link to an archived copy if different from the identifier. Though any link could ultimately fail, this information would at least provide clues to where the user might find an archived copy. When creating captions, apply the standards available within the format you are using to support automated parsing. For example, HTML5 has the <figure> and <figcaption> elements. “Alt'' tags are also widely used to supply context if a feature cannot be viewed. In this respect, a meaningful caption may also meet standards for digital accessibility.

Where non-text features are supplied as separate publication resources, this guideline may also be relevant:
24. Create metadata for each publication resource

Some platforms support assigning each publication resource its own descriptive metadata and landing page making it possible to cite them independently of the text as a whole. In these cases, if the publisher has the capacity to assign unique persistent identifiers such as valid DOIs, ARK IDs, or handles to each publication resource and to provide this as part of the metadata, this can help maintain connections between the components of a publication and sustain citation links. As an example, consider the case where a video is embedded in an EPUB and it has a caption under it that includes a registered DOI. The DOI points to a page dedicated to the published video. If the publisher no longer has that material, a preservation service may have the option to register the location of its preservation copy with doi.org so that the link would point to a new location. If a resource is local to the publication and is not intended to be cited or described independently, then a meaningful caption provides useful context, but creating persistent identifiers isn’t necessary.

These guidelines also relate to the use of identifiers:
17. Use persistent identifiers to link or cite external resources
24. Create descriptive metadata for each publication resource, include identifiers
31. Assign persistent identifiers to significant versions

Where there is a strong justification for using remote resources or non-core media, EPUB supports a fallback option that allows something else that is supported to be displayed in its place. This functionality should be used in these instances.

These guidelines may also be relevant when considering use of non-core media types:
29. Consider a preservation-specific version of the EPUB
41. Harvesting the content of iframes may have unpredictable outcomes

If externally linked web content must be visually embedded in an EPUB, recognize that it is at very high risk for loss. If the content cannot be moved inside the EPUB container using supported features, this material should have an informative caption and be described clearly in the structural metadata within the EPUB. Specifically, the package’s manifest metadata should have an item that: (a) specifies the resource URL (b) lists “remote-resources” as a property, and (c) defines a fallback item. If the embedded web content is not supplied to the preservation service, but can be successfully harvested, this additional metadata could facilitate a preservation workflow to identify and capture these features using an appropriate harvesting tool. If for example a visually embedded Google Trends chart no longer displays active content in the future, an archived web page with this chart could be accessed instead. This content should be noted consistently and documented as part of the publication that needs to be preserved. In general, any consistency that makes it easy to automatically identify the visually embedded web-based features within the text increases the chance of designing a scalable workflow to manage it.

These guidelines may also be relevant to embedding web content in an EPUB:
16. Captions for non-text features add meaningful context
40. Indicate the license status of resource in the HTML around the object
41. Use HTML iframes with caution
42. Facilitate a local web archive workflow for iframe content

Iframe, short for “inline frame,” is an HTML tag that can be used to embed the content from any URL inside an HTML-based document such as an EPUB or webpage. Some publishers may use an iframe to embed things like YouTube videos, or advanced media players into an EPUB. It is more sustainable to use html <video> or <audio> elements when embedding audio or video. EPUB3 readers are not required to support iframes. If used, the content may not render in all EPUB3 readers and is at a high risk of loss through link rot.

These guidelines are also be relevant to embedding media in EPUBs:
12. Start discussions around multimedia early in the process
14. Avoid external dependencies in general
34. Opt for core media types when embedding multimedia in an EPUB

A preservation service may not collect web content outside of the agreed upon domain names unless copyright for the content being harvested is clear. If third-party pages and features that are visually embedded in an EPUB or a web-based publication are meant to be preserved, it should be possible to identify which content publishers have the right to collect so that a web crawler can be configured to include or exclude it. One way to differentiate could be to consistently express the rights in the metadata that is supplied to the preservation service. Another option is to apply structured metadata describing the rights status to the HTML. The Creative Commons REL documentation includes examples of this that cover both page- and object-level licenses - this approach could support automated harvesting decisions at either level. Alternatively, a publisher could supply a list of domain names to include for harvest during the initial preservation workflow configuration.

These guidelines may also be useful to consider when embedding external web content:
25. Add license information to resource-level metadata
38. List the URLs for external web content in the metadata
45. Embed metadata that includes a license in the <head> of a web page

An HTML iframe can contain a wide range of types of content, from a wide range of sources, which makes them a challenge for preservation. The quality of automated website archiving in general can vary greatly. If an iframe is embedded in an EPUB or website, the more inconsistent, complex, and dynamic their content, the more likely they will be lost in an automated process. If these features are important to preserve, consider a manual process to capture and package the intellectual components of the iframe content in another form. For example, a video or screenshot with a caption that links to the website might be a sufficient fallback for conveying the contents of the iframe.

These guidelines may also be relevant to use of iframes:
38. List the URLs for each embedded iframe in the metadata
39. Avoid use of iframes in EPUBs
42. Facilitate a local web archiving workflow to support iframes

Preservation services might not support a workflow that automatically harvests the content of iframes embedded within an EPUB. Even with such a service, the quality could vary greatly, and the content might change following publication. If fallback options are not sufficient a more stable approach would be for the publisher to create an archived copy of the web page featured in the iframe. While there are tools that can be run locally by the publisher to perform single page archiving, there are also third party archiving services such as archive.today or Internet Archive’s Save Page Now service that allow you to archive a single page before publication and generate a persistent link for the embedded web content. This link could be included in a descriptive caption under the embedded feature. Publishers should test the outcome of these single page captures as quality can vary depending on the complexity of the website and the harvest method applied.

These guidelines may also be relevant:
14. Avoid dependence on externally hosted platforms for core features
38. Avoid the use of iframes to embed multimedia

Linking to media that is hosted on YouTube or Vimeo is a threat to platform and content longevity, especially for media that is owned or managed by third parties. In order to mitigate against future link rot and the general instability of archiving streamed content, where appropriate (technically and legally), host a local copy of any media assets and embed it in the web page using standard HTML5 media tags. In order to keep the overall size of embedded media manageable for access and for the purpose of web archiving, it may be advantageous to embed lower quality copies of the media and link to higher resolution versions via persistent links such as DOIs.

See also:
12. Start discussions about multimedia features early
14. Avoid depending on externally hosted web services

In order to improve the likelihood that content published to the web will be able to be captured via web archiving methods, developers could preload any content that would otherwise depend on user interactions. For example, rather than repeatedly making small API calls as the user interacts with a feature, if the dataset that supports the feature is small enough, load the data as a JSON file when the page loads so that further server calls are not necessary.

This guidelines describes another approach:
50. Consider a “progressive enhancement” design to support a scriptless environment

Avoid using the “embed” option to insert a social media post into your publication. This can be unstable for preservation and for long-term sustainability since posts or accounts may be deleted. If the social media post is integral to the work, consider first taking a screenshot that can be embedded into the publication as an image. Underneath, a caption should indicate the origin of the post. Finally, use a web archive service such as archive.today or Internet Archive’s Save Page Now service to create a copy of the post—be sure to test the results, since archiving social media posts can be unreliable. The two links (live and archived) could be referenced as a citation or footnote depending on local practices.

These guidelines are also relevant to embedding social media posts in a publication:
8. Ensure terms of service cover preservation of data in third-party services
14. Avoid depending on third party services for core intellectual components
55. Consider ethical implications of embedding social media posts

Some publications, especially in a web environment, may include social media posts or user contributed content that are automatically included in the archive package, especially with a web harvesting approach. Before implementing these features or including them in publications, consider whether taking copies of them infringes on individual rights or safety. Preservation services may not be able to evaluate specific situations in a scalable way, and so it’s important to avoid including these in the preservation scope if there is uncertainty around them. This may involve designing a website in a way that certain content can easily be excluded e.g. keeping this content at a separate URL that can be skipped during crawling. The Documenting the Now project website includes information about ethical collection of social media content.

These guidelines discuss legal and technical considerations for preservation:
8. Ensure terms of service cover preservation of data in third-party services
54. Avoid embedding social media posts in a publication

Dynamic maps such as those generated with Google Maps, consist of many smaller map tiles that are loaded on the fly as users pan and zoom. Web crawlers cannot easily capture this experience, nor can this be exported. If the map is not the focal point of the work and is being used to present a small number of locations, consider using one or more still images. Display the place name and coordinates for the pin in the caption and provide a link to a live map.

These guidelines offer alternative ways to manage dynamic map features:
16. Captions add important context to non-text features
53. Consider web page designs that pre-load all data when the page loads

Some web-based features require communication with a server that is driven by an unpredictable user interaction or utilizes an open-ended number of URLs to retrieve the data to support that feature. These features cannot be exported easily due to their dependence on a live website and cannot be captured well using web archiving, which depends on identifying every unique URL. Examples include: dynamic maps (e.g. Google Maps), full text or faceted search, web forms, data visualizations (e.g. ArcGIS), IIIF image viewers, and streamed content. Some features can be redesigned to remove their dependency on a live server, but if they can’t, publishers will need to consider what can be preserved. There are many strategies for this, for example: create a simpler static version of the feature that incorporates the key features for the purpose of preservation; embed a local copy of a server based resource rather than depend on a third party service; supply code or data for the feature with documentation for re-assembling the functionality; record a video of the interaction as it behaves in the published environment for future playback; or, a combination of these.

These guidelines offer alternative ways to manage features that depend on a live server:
16. Captions add important context to non-text features
53. Consider web page designs that pre-load all data when the page loads
63. Supply raw data, documentation for data visualizations

Data visualizations tend to be a particular arrangement of one or more raw datasets. Data visualization formats can obscure parts of the underlying data that they are derived from. They may also be compiled or complex. All of these properties could potentially make the data difficult to open, validate, or comprehend in the future. To preserve a publication in which data visualizations are core intellectual components, request underlying raw data from the author. Request supporting documentation that would enable a future reader to retrace the author's steps from the raw data to the visualization. Images or videos of the visualization may also be helpful for recreating it. For both visualization and raw data formats, as with all supplements, ideally the files will be an open or broadly adopted format. The Library of Congress Recommended Formats Statement can help with selecting formats. In the case of vector data, for example, there is not a broadly adopted open format, but Shapefile, while proprietary, is broadly adopted and openly documented. There are a variety of tools that can read Shapefiles which increases the likelihood that it will continue to be supported in some form.

These guidelines may also be relevant when considering preservation of data visualizations:
11. Use non-proprietary, broadly supported and adopted open file formats
57. Use alternative approaches for features that require communication with a server
64. Use meaningful file names and field names in your data, supply documentation